Award  Number:  DAMD17-98-1-8044 


AD 


TITLE:  A  Computer-Based  Decision  Support  System  for  Breast 

Cancer  Diagnosis 


PRINCIPAL  INVESTIGATOR:  Zuyi  Wang 


CONTRACTING  ORGANIZATION:  The  Catholic  University  of  America 

Washington,  DC  20064 


REPORT  DATE:  September  2001 


TYPE  OF  REPORT:  Annual  Summary 


PREPARED  FOR:  U.S.  Army  Medical  Research  and  Materiel  Command 
Fort  Detrick,  Maryland  21702-5012 


DISTRIBUTION  STATEMENT:  Approved  for  Public  Release; 

Distribution  Unlimited 


The  views,  opinions  and/or  findings  contained  in  this  report  are 
those  of  the  author (s)  and  should  not  be  construed  as  an  official 
Department  of  the  Army  position,  policy  or  decision  unless  so 
designated  by  other  documentation. 


2002Q1W  IN 


REPORT  DOCUMENTATION  PAGE 

Form  Approved 

OMB  No.  074-0188 

Public  reporting  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and  maintaining 
the  data  needed,  and  completing  and  reviewing  this  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information,  including  suggestions  for 
reducing  this  burden  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302,  and  to  the  Office  of 
Manaaement  and  Budaet.  Paperwork  Reduction  Project  (0704-0188),  Washington,  DC  20503  _ ^ . — 

1.  AGENCY  USE  ONLY  (Leave  blank) 

2.  REPORT  DATE 

September  2001 

3.  REPORT  TYPE  AND  DATES  COVERED 

Annual  Summary  (1  Sep  00-31  Aug  01) 

4.  TITLE  AND  SUBTITLE 

A  Computer-Based  Decision  Support  System  for 
Diagnosis 

Breast  Cancer 

5.  FUNDING  NUMBERS 

DAMD17-98-1-8044 

6.  AUTHOR(S) 

Zuyi  Wang 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

The  Catholic  University  of  America 

Washington,  DC  20064 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

E-Mail:  zwang@pluto.ee.cua.edu 

9.  SPONSORING  /  MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

U.S.  Army  Medical  Research  and  Materiel  Command 

Fort  Detrick,  Maryland  21702-5012 

10.  SPONSORING  /  MONITORING 

AGENCY  REPORT  NUMBER 

11.  SUPPLEMENTARY  NOTES 


12a.  DISTRIBUTION  /  AVAILABILITY  STATEMENT 

12b.  DISTRIBUTION  CODE 

Approved  for  Public  Release;  Distribution  Unlimited 

13.  ABSTRACT  (Maximum  200  Words) 


The  goal  of  this  project  is  to  develop  decision  support  system  for  breast  cancer  diagnosis,  treatment  option,  prognosis,  and  risk 
prediction.  This  project  focuses  on  the  development  of  advanced  image  pattern  analysis  in  diagnostic  imaging  and  information 
integration  methodology  to  statistically  analyze  the  distinction  between  lesion-like  normal  site  and  real  lesion  site.  The  specific 
aims  of  this  research  project  are:  (1)  image  pattern  analysis  of  breast  tissue  in  mammography  using  both  computational  features 
and  BI-RADS  features  provided  by  radiologist  for  the  prediction  of  malignancy  associated  with  masses;  (2)  development  of  visual 
presentation  methods  for  radiologists’  use  in  the  consultation  system;  (3)  performing  a  pre-clinical  test  through  an  ROC  analysis. 
The  clinical  goal  of  this  consultation  system  is  to  provide  scientific  tools  for  doctors  to  have  electronic  magnification  views,  to 
perform  feature  analysis  of  suspected  mammographic  patterns,  to  access  a  large  database  and  investigate  clinically  similar  cases, 
and  to  visually  inspect  the  features  of  a  case  in  various  statistical  distribution  using  graphic  displays.  We  have  accomplished  (1) 
feature  extraction,  (2)  feature  database  construction,  (3)  data  mining  visual  explanation  tool  development,  (4)  feature  database 
structure  exploration,  (5)  feature  ranking  and  selection,  and  (6)  classification  of  mass  and  non-mass  based  on  selected. 


14.  SUBJECT  TERMS 

Breast  Cancer  Diagnosis,  Breast  Cancer  Patient  Database,  Decision  Support  System, 
Computer-Aided  Diagnosis,  Visual  data  exploration.  Artificial  Intelligence 

15.  NUMBER  OF  PAGES 

31 

16.  PRICE  CODE 

17.  SECURITY  CLASSIFICATION 
OF  REPORT 

Unclassified 

18.  SECURITY  CLASSIFICATION 
OF  THIS  PAGE 

Unclassified 

19.  SECURITY  CLASSIFICATION 

OF  ABSTRACT 

Unclassified 

20.  LIMITATION  OF  ABSTRACT 

Unlimited 

NSN  7540-01-280-5500 


Standard  Form  298  (Rev.  2-89) 

Prescribed  by  ANSI  Std.  Z39-18 
298-102 


Table  of  Contents 


Cover . 1 

SF  298 . 2 

Table  of  Contents . 3 

Introduction . 4 

Body . 5 

Key  Research  Accomplishments . 12 

Reportable  Outcomes . 12 

Conclusions . 12 

References . 13 

Appendices . I4 


3 


A  Computer-Based  Decision  Support  System  for  Breast  Cancer  Diagnosis 


Introduction 

The  goal  of  this  pre-doctoral  training  project  is  to  develop  decision  support  system  for 
breast  cancer  diagnosis,  treatment  option,  prognosis,  and  risk  prediction.  This  system  is 
desired  to  function  as  a  consultation  system  for  both  doctors  and  patients.  This  project 
focuses  on  the  development  of  advanced  image  pattern  analysis  in  diagnostic  imaging  and 
information  integration  methodology  to  statistically  analyze  the  distinction  between  lesion¬ 
like  normal  site  and  real  lesion  site.  Based  on  our  intensive  observation  and  experimental 
evidence,  we  believe  this  problem  can  be  better  solved  through  statistical  approach.  The 
specific  aims  of  this  research  project  are:  (1)  image  pattern  analysis  of  breast  tissue  in 
mammography  using  both  computational  features  and  BI-RADS  features  provided  by 
radiologist  for  the  prediction  of  malignancy  associated  with  masses;  (2)  development  of 
visual  presentation  methods  for  radiologists’  use  in  the  consultation  system;  (3)  performing  a 
pre-clinical  test  through  an  ROC  analysis.  The  clinical  goal  of  this  consultation  system  is  to 
provide  scientific  tools  for  doctors  to  have  electronic  magnification  views,  to  perform  feature 
analysis  of  suspected  mammographic  patterns,  to  access  a  large  database  and  investigate 
clinically  similar  cases,  and  to  visually  inspect  the  features  of  a  case  in  various  statistical 
distribution  using  graphic  displays.  In  the  whole  period  of  this  research,  we  have 
accomplished  (1)  feature  extraction,  (2)  feature  database  construction,  (3)  high  dimensional 
data  mining  visual  explanation  tool  development,  (4)  feature  database  structure  exploration 
using  visual  exploration  tool,  (5)  feature  ranking  and  selection  based  on  feature  database 
structure  exploration,  and  (6)  neural  network  classifier  designing  based  on  selected  features 
through  feature  database  structure  analysis. 
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Overview  of  Training  and  Research  Accomplishment 

1.  Research  Skill  Training  and  Literature  Background  Preparation 

In  the  whole  period  of  doctoral  training,  the  development  of  research  skill  is  very 
much  appreciated  through  working  with  the  mentors  of  this,  which  made  it  possible  for 
me  to  continue  my  research  work  and  further  obtain  the  advanced  degree.  From  the  first 
program  for  reading  and  digital  mammogram  for  processing,  to  the  selection  of  cases, 
and  then  to  the  understanding  of  fundamental  engineering  components  that  are  essential 
for  the  research,  my  academic  advisor  Dr.  Yue  Wang  at  The  Catholic  University  of 
America,  my  mentors  Dr.  Shih-Chung  Lo  and  Dr.  Matthew  Freedmen  at  Georgetown 
University  Medical  Center,  provided  as  much  tremendous  help  as  they  can.  After  one 
year  of  research  work,  my  insight  on  research  approaches  and  capability  of  problem 
solving  have  been  gradually  established  and  improved.  We  often  discussed  and  reviewed 
the  primary  goal  of  this  project  in  the  research  process  in  order  to  keep  my  work  in  the 
right  direction  and  give  a  global  view  of  the  all  components  of  CAD.  They  helped  me 
write  better  programs  for  image  processing,  and  discussed  the  intermediate  results  of 
calculation  with  me  for  further  research  planning.  Comparing  to  myself  two  years  ago 
before  working  on  this  project,  I  see  big  difference,  and  I  am  very  grateful. 

Under  the  guidance  of  Dr.  Wang  and  Dr.  Lo,  literature  and  book  searching  and 
reading  gave  me  better  and  broader  view  of  breast  cancer  and  computer-assisted 
diagnosis  (CAD)  system  research.  Through  reading  engineering  textbooks,  the 
fundamental  knowledge  that  is  critical  to  the  project  is  greatly  enhanced.  The  major 
books  I  have  been  reading  and  using  as  all-time  references  are  Neural  Network  -  A 
Comprehensive  foundation  by  Simon  Haykin,  An  Introduction  to  Signal  Detection  and 
Estimation  by  H.  Vincent  Poor,  Elements  of  Information  Theory  by  Thomas  M.  Cover, 
and, Statistical  analysis  of  finite  mixture  distributions  by  D.  M.  Titterington,  A.  F.  M. 
Smith,  and  U.  E.,  etc.  After  searching  and  technical  papers  in  several  major  engineering 
journals,  such  as  IEEE  Transactions  on  Medical  Imaging,  IEEE  Transactions  on  Neural 
Network,  IEEE  Transactions  on  Pattern  Recognition  and  Machine  Intelligence,  and 
Medical  Physics,  etc.,  I  have  collected  almost  one  hundred  of  relevant  papers  in  order  to 
have  an  overview  of  work  done  by  other  researchers  in  this  particular  area,  and  also  set  a 
start  point  and  direction  for  my  own  research.  The  more  I  read,  the  better  my  capability  of 
understanding  and  judging  other  researchers’  work.  After  two  years  of  intensive  literature 
reading,  I  not  only  learned  many  advanced  engineering  components,  but  also  gradually 
learned  scientific  method  for  problem  solving. 

2.  Research  Accomplishments 

2.1  Clinical  Case  and  Feature  Database  Development 
2.1.1  Clinical  Case  Selection 

The  first  step  for  establishing  a  feature  database  was  primarily  finished  in  the  first 
and  second  years,  which  is  case  collection  and  selection  that  are  fundamental  and 
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crucial  for  the  further  research  work.  In  order  to  detect  suspicious  mass  regions  from 
a  mammogram,  we  have  to  be  able  to  find  out  major  differences  between  mass  and 
non-mass  regions  so  that  both  mass  and  non-mass  case  groups  are  needed  for 
comparison  purpose.  The  major  mammogram  sample  source  that  can  be  accessed  and 
are  found  proper  for  the  use  in  this  project  is  ISIS  at  Georgetown  University  Medical 
Center.  The  ISIS  database  is  constructed  by  extracting  suspicious  mass  regions  from 
mammograms  by  licensed  radiologists  and  finally  proven  by  biopsy  procedure,  from 
where  we  obtained  103  cases,  among  these  71  are  mass  cases  and  32  are  non-mass 
cases.  Non-mass  cases  were  purposely  selected  from  normal  breast  tissue  regions 
with  similarity  of  mass. 

2.1.2  Image  Feature  Extraction 

After  the  preparation  of  mammogram  cases,  the  next  important  consideration  is  to 
choose  features  that  can  be  used  to  distinguish  mass  and  non-mass  cases  effectively 
and  with  high  detective  rate.  The  image  block  was  first  processed  by  enhanced 
segmentation  procedure  to  extract  the  exact  position  where  a  mass  may  present.  The 
position  of  the  segmented  area  was  then  a  very  useful  reference  for  feature 
calculation.  Many  features  have  been  tested  by  other  researchers  on  their 
effectiveness  for  mass  and  non-mass  distinction,  and  the  results  have  been  presented 
in  their  most  recent  papers.  Based  on  literature  and  medical  book  searching  and 
reading,  primarily  we  chose  nine  features,  among  them  are  eight  texture  features  and 
shape  feature. 

Eight  texture  features  were  calculated  based  spatial  gray  level  dependence  matrix, 
they  are  energy,  correlation,  inertia,  entropy,  inverse  difference  moment,  sum 
average,  sum  entropy,  and  difference  entropy.  Texture  feature,  in  some  scale,  may  be 
fairly  good  for  revelation  of  fine  texture  differences  in  images,  which  cannot  be  seen 
by  human  eyes.  They  were  examined  by  several  research  groups  for  their 
effectiveness  in  terms  of  improvement  of  CAD  performance. 

Shape  feature,  primarily  compactness  has  been  used  to  distinguish  non-mass  cases 
from  the  whole  case  population  in  previous  study.  Through  observing  hundreds 
mammograms,  shape  feature  is  found  to  be  essential  for  detecting  masses  merging  in 
many  mass-like  normal  breast  tissues.  Most  of  mass  cases  are  relatively  well-defined 
round  objects,  however,  the  overall  shape  of  dense  normal  breast  tissues,  such  as 
glandular  elements  and  blood  vessels  embedded,  are  often  slender  rather  than  round. 
The  simple  way  of  compactness  calculation  is  to  divide  the  area  of  the  segmented 
area  by  the  square  of  perimeter  of  the  contour.  Therefore  the  compactness  of  a  perfect 
circle  is  one.  The  closer  the  object  shape  is  to  a  circle,  the  closer  the  compactness  is  to 
one.  Compactness  calculation  is  difficult  in  the  first  unavoidable  and  crucial  step  that 
is  to  extract  a  continuous  contour  so  that  a  precise  perimeter  of  the  segmented  area 
can  be  calculated.  The  difficulty  of  continuous  contour  extraction  comes  from  the 
randomness  of  contour  shape  and  the  demand  of  continuity  of  contour,  even  some 
existing  methods  proposed  for  continuous  contour  extraction  in  some  image 
processing  books  cannot  cover  all  possibilities.  If  only  discontinuous  contour  is 
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needed,  the  problem  becomes  very  easy  since  a  simple  scan  of  the  image  can  bring  us 
a  list  of  contour  pixel  coordinates.  However,  a  simple  task  that  can  be  easily  done  by 
human  is  sometimes  very  challenging  for  a  computer  program.  In  order  to  surmount 
this  obstacle,  we  designed  a  universal  contour  extraction  method  that  can  deal  with  all 
possibilities  of  position  relationship  between  any  pixel  and  its  neighboring  pixels, 
including  all  kinds  of  intersections  and  branches  of  one  contour.  The  basic  strategy  of 
this  universal  method  is  that  scanning  all  neighboring  pixels  of  each  pixel, 
memorizing  all  branches  around  this  pixel,  deciding  which  branch  is  the  right  the 
direction  for  obtaining  a  continuous  contour,  and  deleting  pixels  that  have  been 
collected  in  the  contour  in  order  to  avoid  collecting  one  same  pixel  for  more  than 
once.  Such  a  method  made  it  possible  to  precisely  calculate  the  compactness.  In  the 
following  sections  of  this  report,  we  will  discuss  the  experimental  results  that  showed 
that  compactness  played  an  important  role  in  the  distinction  of  mass  and  non-mass 
cases. 

3.  Feature  Database  Structure  Exploration  and  Neural  Network  Classifier  Design 
3.1  Visual  Data  Explanation  and  Mining  Tool  Design 

Although  among  many  approaches  of  CAD  research,  some  CAD  systems  are 
sophisticated  and  claimed  to  have  impressive  performance,  several  fundamental 
issues  remain  unsolved.  For  example,  Receiver  Operating  Characteristics  (ROC)  can 
provide  an  overall  performance  evaluation,  but  it  may  not  help  improve  each 
individual  component  in  CAD  system.  Furthermore,  since  machine  observer  and 
human  observer  may  not  detect  the  same  set  of  masses,  the  black  box  nature  of  most 
CAD  systems  may  prevent  a  natural  on-line  integration  of  human  and  machine 
intelligence  and  further  upgrade  of  a  CAD  system.  As  a  strategic  move  toward 
improving  CAD  design  and  utility,  we  developed  a  visual  data  exploration  and 
mining  tool.  Our  effort  is  to  (1)  provide  a  visual  map  of  feature  database  prior  to 
knowledge  encoding  component  so  as  to  evaluate  and  improve  the  pre-processing  and 
signature  extraction;  (2)  based  on  the  resulting  map  to  design  an  optimal  classifier 
best  fitted  to  the  particular  database  structure  for  knowledge  encoding;  and  (3) 
combine  the  map,  the  classifier  output,  raw  image,  and  user  interface  to  explore  and 
explain  the  whole  decision  making  process  by  both  radiologist  and  CAD  system. 

3.1.1  Discriminative  Projection 

Dimension  reduction  is  the  first  thing  on  which  we  spent  great  effort.  There 
are  two  major  reasons  why  we  have  to  do  dimension  reduction:  (1)  visualization 
demand  (2)  cluster  separation.  Due  to  the  high  dimensionality  of  the  feature  dataset 
(in  this  case,  the  number  of  dimensions  is  nine),  it  is  difficult  for  visual  data  mining. 
While  it  is  possible  to  encode  several  more  dimensions  into  a  graph  by  using  various 
symbols  and/or  colors,  the  human  perceptual  system  is  not  prepared  to  deal  with  more 
than  three  dimensions  simultaneously.  Principal  component  analysis  (PCA)  is  an 
effective  unsupervised  method  for  achieving  dimensionality  reduction.  Using  PCA, 
we  can  find  those  orthogonal  axes  onto  which  the  projections  retain  maximal 
variance.  Thus  a  lower  dimensional  new  representation  of  the  set  of  observed  vectors 
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in  the  space  represented  by  the  principal  component  axes.  However,  by  examining  the 
limitation  of  PCA,  we  find  that  it  may  not  be  proper  to  fulfill  our  expected  role  in  our 
feature  data  structure  discovery  since  we  not  only  have  to  capture  maximal 
information  from  the  feature  data,  but  also  need  to  cluster  the  data  points  to  identify 
the  data  territory  in  the  space.  Another  concern  is  the  identification  of  features  among 
all  calculated  features,  which  is  responsible  for  cluster  separation  and  further  mass 
and  non-mass  classification.  The  limitation  of  PCA  is  that  the  dimensions  with  large 
variances  but  small  cluster  separability  may  play  dominant  roles  in  determining  the 
projections  and  further  mislead  the  dimensionality  reduction  for  cluster  separation 
purpose. 

We  move  the  conventional  PCA  to  a  direction  in  which  it  may  serve  as  a 
discriminant  criterion  so  that  clusters  are  to  be  separated  and  visualized  to  meet  the 
need  of  cluster  separation.  While  conventional  PCA  is  an  unsupervised  method, 
discriminative  principal  component  analysis  (DPCA)  is  a  supervised  method  that  is 
applied  when  prior  knowledge  of  class  information  has  been  obtained.  Based  on  the 
class  information,  a  better  way  of  finding  directions  for  cluster  separation,  however, 
is  to  emphasize  the  inter-cluster  separation  by  using  Fisher's  scatter  matrix  instead  of 
total  covariance  matrix  in  conventional  PCA.  This  is  a  discriminative  projection 
searching  process, 


W  =  argmax{Trace(WjS"JS(,Wo)} 

Wo 

where  Sw  is  the  within-cluster  scatter  matrix,  Sb  is  the  between-cluster  scatter  matrix, 
and  W  is  the  optimum  projection  matrix.  This  is  termed  as  discriminative  principal 
component  analysis  (DPCA). 


Fig.  1 .  2  -D  projections  of  feature  data  using  conventional  PCA 
and  DPCA,  *  —  mass  and  o  -  non  mass. 

From  fig.  1,  we  can  see  the  difference  between  projections  resulting  from 
conventional  PCA  and  discriminative  PCA  on  the  effect  of  cluster  separation.  In  the 
left  figure  conventional  PCA  is  applied,  mass  and  non-mass  data  points  are  mixed 
uniformly  without  revealing  any  cluster  structure,  while  mass  and  non-mass  data 
points  define  clearer  distribution  structure  in  the  right  figure  that  is  resulted  from 
discriminative  PCA. 
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3.1.2  Hierarchical  Structure 


The  According  to  Cover's  theorem  on  the  separability  of  patterns,  when  a  data  set 
is  linearly  projected  onto  a  single  dimension-reduced  subspace,  its  inherent  multi¬ 
modal  nature  may  be  partially  or  completely  obscured.  The  revelation  of  growing 
volume  of  high  dimensional  and  multi-modal  data  set  demands  a  data  mining  tool 
differing  from  conventional  data  visualization  method,  which  is  capable  of  dealing 
with  high  dimensional  data  set.  This  motivates  our  consideration  of  a  hierarchical 
visualization  paradigm  involving  hierarchical  statistical  models  and  visualization 
space.  Comprehensive  studies  on  this  issue  brought  us  the  possibility  of  using  several 
complementary  visualization  subspaces  to  accomplish  this  complicated  task.  In  this 
algorithm,  dimensionality  reduction  and  cluster  decomposition  are  two  major 
components.  The  cluster  decomposition  permits  the  use  of  relatively  simple  models 
for  each  local  structure,  offering  great  ease  of  interpretation  as  well  as  many  benefits 
of  analytical  and  computational  simplification.  On  the  other  hand,  dimensionality 
reduction  allows  visual  explanation  of  high  dimensional  data  set  and  less 
computational  demand.  We  proposed  using  standard  finite  normal  mixtures  (SFNM) 
and  hierarchical  visualization  spaces  for  as  effective  data  modeling  and  visualization. 
The  strategy  is  that  top  level  model  and  projection  should  explain  the  whole  structure 
of  the  data  set,  while  lower  level  models  explain  the  local  and  internal  structure 
between  individual  cluster,  which  may  not  be  obvious  in  the  high  level  models.  With 
many  complementary  mixture  models  and  visualization  projections,  each  level  will  be 
relatively  simple  while  the  complete  hierarchy  maintains  overall  flexibility  yet  still 
conveys  considerable  cluster  information.  Fig.2.  shows  an  example  of  a  hierarchical 
visualization  tree  generated  using  a  set  simulated  data.  The  left  figure  is  a  top  level 
projection  of  the  data  where  we  can  only  see  two  clusters  without  incorporating  color 
information,  the  upper  right  figure  is  a  second  level  projection  that  provides  different 
views  of  two  sub-clusters  selected  in  the  top  level  projection.  In  the  second  level,  we 
can  see  two  hidden  clusters  in  sub-cluster  #2  in  the  projection  differing  from  the  top 
view,  this  gives  the  user  opportunities  to  discover  true  data  structure  and  makes 
further  partitioning  possible. 


Fig.  2.  User  Interface  of  Visual  Data  Exploration  and  Mining  Tool 
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3.1.3  User  Interaction 


User  interaction  with  the  algorithm  is  also  an  important  issue.  We  have  developed 
a  user-friendly  graphical  interface  to  facilitate  the  data  visualization  purpose,  as 
shown  in  Fig.  1,  which  allows  the  user  to  select  initial  centers  of  the  data  clusters.  Our 
experience  has  convincingly  indicated  a  great  reduction  of  both  computational 
complexity  and  local  optimum  likelihood.  It  should  be  pointed  out  that  although  the 
final  SFNM  model  can  be  estimated,  the  pathways  of  achieving  cluster  decomposition 
may  be  multiple.  For  example,  in  this  case  the  user  has  the  flexibility  to  select  only 
two  clusters  in  the  second  level  and  to  further  split  the  "right"  cluster,  thus  to  adopt  a 
three-level  hierarchy.  We  believe  that  this  user-driven  nature  of  the  current  algorithm 
is  also  highly  appropriate  for  the  visualization  context. 


3.2  Feature  Database  Exploration  for  Feature  Selection  and  Classifier  Design 

As  the  primary  goal  of  the  visual  explanation  and  mining  tool  development,  we 
use  it  to  reveal  and  explain  feature  database  structure  for  CAD  design  purpose.  We 
try  to  make  both  hidden  data  patterns  and  neural  network  "black  box"  to  be  as 
transparent  as  possible  to  users,  such  as  radiologists  and  patients,  through  interactive 
visual  explanation. 

3.2.1  Feature  Selection 

We  tried  to  rank  and  select  features  that  are  responsible  for  differentiating  mass 
and  non-mass  cases.  One  of  advantages  of  the  work  is  to  reduce  the  computation  load 
for  classifier  via  reducing  the  dimension  of  the  feature  dataset.  Also,  the  performance 
of  classifier  may  even  be  improved  if  only  the  features  with  high  discrimination 
power  are  used  while  the  non-discriminative  features  are  discarded.  Although  the 
simple  method  of  selecting  just  the  best  individual  feature  without  considering 
dimension  dependence  may  fail  dramatically,  it  might  still  be  worthy  as  a  first  step. 
We  applied  our  software  to  model  the  dataset  with  an  SFNM  distribution.  Based  on 
the  distribution  model,  we  can  perform  DPCA  to  determine  the  top  discriminative 
principal  axes.  The  nine  features  were  ranked  in  their  discriminative  power  from  high 
to  low:  energy,  sum  entropy,  compactness,  inertia,  sum  average,  entropy,  correlation, 
difference  entropy,  and  Inverse  difference  moment.  This  result  is  in  turn  fully  used  in 
the  classifier  design  that  will  be  discussed  in  the  following  section. 

3.2.2  Neural  Network  Classifier  Design 

In  classifier  selection  and  design,  feature  database  structure  is  the  major  guidance 
we  can  depend  on.  All  these  approaches  have  the  only  important  goal  that  is  to 
improve  CAD  performance  in  a  rational  way  so  that  we  can  explain  how  we  design 
each  component  of  the  CAD  system,  why  such  an  integrated  system  works  or  does 
not  work,  and  further  explain  to  radiologists  to  get  feedback  on  the  development,  the 
process  is  fairly  transparent  to  users.  Based  on  the  feature  ranking,  we  designed  a 
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backpropagation  multiple 
layers  network  classifier,  with 
one  input  layer,  two  hidden 
layers  and  one  output  layer. 
The  number  of  inputs  was 
reduced  from  nine  to  three  by 
using  the  top  three  features 
resulting  from  the  feature 
ranking.  The  performance  of 
the  classifier  has  been 
analyzed  using  receiver 
operating  characteristics 

(ROC),  also  the  resulting 
performance  was  compared 
with  that  of  the  classifier 


using  all  features  as  inputs. 

The  comparison  showed  that  Fig.  3.  ROC  analysis  of  neural  network  classifier 
the  three  top  features  together 

can  completely  represent  the  full  feature  dataset  in  term  of  classification,  and  further 
more  the  performance  is  even  better,  which  is  shown  in  fig.  3.  The  Az  value  of  the 
classifier  with  three  inputs  is  0.78,  for  the  classifier  using  all  features  it  is  0.68. 
Although  the  results  are  still  not  very  promising,  such  a  design  approach  is  giving  us 
much  more  understanding  of  how  the  feature  database  can  be  used  for  classifier 
design.  Not  only  is  the  success  of  feature  ranking  and  selection  in  classifier  design 
reflected  in  the  lowering  computational  cost  through  dimension  reduction,  but  also 
implies  that  the  combination  of  top  rated  feature  has  more  discriminative  power  in  the 
classification. 
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Key  Research  Accomplishments 

•  Improving  research  skill  and  enhancing  fundamental  engineering  knowledge  through 
book  and  literature  searching  and  reading  under  guidance  of  advisor  and  mentors. 

•  Collecting  image  cases,  processing  images  by  computing  image  features  and  constructing 
high  dimensional  image  feature  database. 

•  Developing  and  improving  visual  data  explanation  and  mining  tool  and  exploring  feature 
database  structure  for  feature  selection  and  classifier  design  to  make  the  CAD  design 
processing  effective  and  reasonable. 

•  Designing  classifier  and  improving  classifier  performance  based  on  data  structure 
exploration  and  the  study  of  features. 


Reportable  Outcomes 

•  Y.  Wang,  Z.  Wang,  L.  Luo,  S-H.  B.  Lo  and  M.  T.  Freedman,  "Computer-Based  Decision 
Support  System:  Visual  Mapping  of  Featured  Database  in  Computer-Aided  Diagnosis", 
Proc.  OfSPIE,  Image  Processing,  Vol.  1,  No.  24,  pp.  136-147,  February  2000. 

•  Feature  extraction  programs. 

•  Visual  data  exploration  and  mining  tool  software. 

•  J.  Lu,  Y.  Wang,  Z.  Wang,  et.  al.,  “Discriminative  Mining  of  Gene  Microarray  Data”, 
Proc.  Of  Neural  Networks  for  Signal  Processing,  pp.  23-32,  September,  2001 . 


Conclusions 

In  this  project,  we  devoted  efforts  in  developing  effective  feature  extraction  methods, 
constructing  feature  database,  developing  visual  explanation  tool  for  data  mining  and 
knowledge  discovery,  which  is  both  statistically  principled  and  visually  effective.  This 
method,  as  illustrated  by  the  well-planned  simulations  and  pilot  applications  in  computer- 
aided  diagnosis,  can  be  very  capable  of  revealing  hidden  structure  within  data.  It  is 
important  to  emphasize  that  in  relation  to  previous  work,  one  interesting  consideration 
with  the  present  algorithm  is  that  the  models  are  determined  by  the  information  theoretic 
criteria,  and  this  criterion  can  not  only  select  the  most  appropriate  model  structure  but 
also  allow  a  user-driven  portfolio  as  a  double  check.  This  approach  promotes  a  self- 
consistent  fitting  of  the  whole  tree,  so  that  an  automated  procedure  for  generating  the 
hierarchy  becomes  reality.  In  addition,  since  we  perform  model  selection  and  parameter 
initialization  firstly  over  the  projection  space,  the  computational  complexity  is  greatly 
reduced  in  compared  to  the  maximum  likelihood  estimation  in  full  dimension.  Other 
possible  advantages  include  the  determination  of  data  projection  by  maximizing  the 
separation  of  clusters,  which  in  turn  optimizes  the  other  crucial  operations  such  as  model 
selection  and  parameter  initialization,  which  help  user  find  hypothesis  driven  nature  of 
the  data  projection.  Using  the  visual  explanation  tool,  we  tried  to  discover  the  feature 
database  structure  for  feature  selection  and  also  classifier  design.  The  performance  of  the 
classifier  reflected  that  the  feature  selection  based  on  feature  ranking  also  makes  it 
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possible  to  reduce  dimensionality  in  classifier  design  besides  in  visual  data  exploration 
software. 
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,  ABSTRACT 

As  a  strategic  move  toward  improving  the  utility  of  computer-aided  diagnosis  (CAD)  in  breast  cancer  detection,  this 
work  aims  to  develop  a  computer-based  decision  support  system,  through  a  visual  mapping  of  featured  database, 
to  explain  the  entire  decision  making  process  jointly  by  the  computer-encoded  knowledge  and  the  user-interaction. 
The  main  purpose  of  the  work  is  twofold;  enhance  the  clinical  utility  of  CAD  and  provide  a  mechanism  for  optimal 
system  design.  We  adopt  a  mathematical  feature  extraction  procedure  to  construct  the  featured  database  from 
the  suspicious  mass  sites  localized  by  the  enhanced  segmentation.  The  optimal  mapping  of  the  data  points  is  then 
obtained  by  learning  a  hierarchical  normal  mixtures  and  associated  decision  boundaries.  A  visual  explanation  of 
the  decision  making  is  further  invented  through  a  multivariate  data  mining  and  knowledge  discovery  scheme.  In 
particular  using  multiple  finite  normal  mixture  models  and  hierarchical  visualization  spaces,  new  strategy  is  that 
the  top-level  model  and  projection  should  explain  the  entire  data  set,  best  revealing  the  presence  of  clusters  and 
relationships,  while  lower-level  models  and  projections  should  display  internal  structure  within  individual  clusters, 
such  as  the  presence  of  subdusters,  which  might  not  be  apparent  in  the  higher-level  models  and  projections.  We 
demonstrate  the  principle  of  the  approach  on  several  multimodal  numerical  data  sets,  and  we  then  apply  the  method 
to  the  visual  explanation  in  CAD  for  breast  cancer  detection  from  digital  mammograms. 

1.  INTRODUCTION 

In  order  to  improve  mass  detection  and  classification  in  clinical  screening  and/or  diagnosis  of  breast  cancers,  many 
sophisticated  computer-assisted  diagnosis  (CAD)  systems  have  been  recently  developed.  Although  the  clinical  roles 
of  the  CAD  systems  may  still  be  debatable,  the  fundamental  role  should  be  complementary  to  the  radiologists 
clinical  duties  or  for  automated  high  risk  population  screening.  Literature  survey  has  indicated  that  (1)  most  CAD 
systems  are  ‘black”  boxes  to  the  users  and  (2)  no  working  link  between  “evaluation”  and  improvement  .  This 
paper  addresses  the  further  development  of  CAD  for  mass  detection  based  on  (1)  construction  of  featured  knowledge 
database;  (2)  mapping  of  classified  and  unclassified  data  points;  and  (3)  development  of  a  visual  exploration  and 
explanation  interface. 

Although  many  previously  proposed  approaches  have  led  to  impressive  results,  several  fundamental  issues  remain 
unresolved.  For  example,  Receiver  Operating  Characteristics  (ROC)  analysis  can  provide  an  overall  performance 
evaluation,  it  may  not  help  the  improvement  of  each  of  the  multiple  components  in  CAD  system.  Furthermore  since 
the  mathine  observer  and  human  observer  may  not  detect  the  same  set  of  masses,  the  “black  box”  nature  of  most 
CAD  systems  may  prevent  a  natural  on-line  integration  of  human  intelligence  and  further  upgrade  of  a  CAD  system. 
Our  effort  is  to:  (1)  provide  a  visual  map  of  featured  database  before  knowledge  encoding  component  so  to  evaluate 
and  improve  the  pre-processing  and  signature  extraction;  (2)  based  on  the  map  to  design  an  optimal  classifier 
fitted  to  this  particular  database  structure  for  knowledge  encoding;  and  (3)  combine  the  map,  the  classifier  output, 
raw  image,  and  user  interface  to  explore  and  explain  the  whole  decision  making  process  by  both  radiologist  and  CAD 
systems. 

Further  author  information:  Send  correspondence  to  Y.  Wang  (E-mail  wang@pluto.ee.cua.edu). 
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2.  BACKGROUND 

As  the  first  step  toward  understanding  multivariate  data  sets,  cluster  information  reveals  insight  that  may  prove 
useful  in  knowledge  discovery  since  the  growing  volume  of  complex  data  are  often  high  dimpnginnal  TrmhiTTwla) 
and  lacking  in  prior  knowledge..4-6-9  Several  new  visualization  methods  have  been  progressively  developed  to  model 
and  display  the  contents  of  the  data  sets.4-6"9-11’14  However,  although  such  algorithms  can  usefully  characterize 
the  content  of  simple  data  sets,  little  comprehensive  study  has  been  reported  that  proves  adequate  in  the  face  of 
multimodal  and  high  dimensional  data  sets.4-9-14  For  example,  a  single  projection  of  the  data  onto  a  visualization 
space  may  not  be  able  to  capture  all  of  the  interesting  aspects  of  the  data  set.  This  motivates  the  consideration  of  a 
hierarchical  visualization  paradigm  involving  hierarchical  statistical  models  and  visualization  spaces. 

Once  we  explore  the  possibility  of  using  many  complementary  visualization  subspaces,  cluster  decomposition 
and  dimensionality  reduction  axe  the  two  major  steps.  Cluster  decomposition  permits  the  use  of  relatively  simple 
models  for  each  of  the  local  structures,  offering  greater  ease  of  interpretation  as  well  as  the  benefits  of  analytical  and 
computational  simplification.  On  the  other  hand,  dimensionality  reduction  allows  better  visual  interpretation  and  less 
computational  demand.  Many  researchers  have  recently  proposed  various  methods  to  improve  data  visualization.6-9 
The  work  most  closely  related  to  our  methodology  was  reported  by  Bishop  and  Tipping  in.4-12  They  introduce  a  - 
hierarchical  modeling  and  visualization  algorithm  based  on  a  two-dimensional  hierarchical  mixture  of  latent  variable 
models,  whose  parameters  are  estimated  using  the  expectation-maximization  (EM)  algorithm.4-19  The  construction 
of  the  hierarchical  tree  proceeds  top  down  in  which  the  cluster  decomposition  is  driven  interactively  by  the  user,  and 
optimal  projection  is  determined  by  maximum  likelihood  principle. 

In  this  paper,  we  propose  using  standard  finite  normal  mixtures  (SFNM)  and  hierarchical  visualization  spaces  for 
an  effective  data  modeling  and  visualization.  The  strategy  is  that  the  top-level  model  and  projection  should  explain 
the  entire  data  set,  best  revealing  the  presence  of  clusters  and  relationships,  while  lower-level  models  and  projections 
should  display  internal  structure  within  individual  clusters,  such  as  the  presence  of  subclusters,  which  might  not  be 
apparent  in  the  higher-level  models  and  projections.  With  many  complementary  mixture  models  and  visualization 
projections,  each  level  will  be  relatively  simple  while  the  complete  hierarchy  maintains  overall  flexibility  yet  still 
conveys  considerable  cluster  information.  Based  on  the  concept  of  combining  finite  mixture  modeling19  and  principal 
component  projection4-14  to  guide  cluster  decomposition  and  dimensionality  reduction,  the  particular  advantages  of 
our  algorithm  are: 


1.  At  each  level,  a  probabilistic  principle  component  extraction  is  performed  to  project  the  softly  partitioned  data 
set  down  to  a  two-dimensional  visualization  space,  leading  to  an  effective  dimensionality  reduction,  allowing 
effective  separation  and  visualization  of  local  clusters4-8-15; 

2.  Learning  from  the  data  directly,  information  theoretic  criteria  are  used  to  select  model  structures  and  estimate 
its  parameter  values,  where  the  soft  partitioning  of  the  data  set  results  in  a  standard  finite  normal  mixture 
distribution  best  fitted  to  the  data7’21"25; 

3.  By  alternatively  performing  principal  component  projection  and  finite  mixture  modeling,  a  complete  hierarchy 
of  complementary  projections  and  refined  models  can  be  generated  automatically,  allowing  a  new  paradigm  of 
knowledge  discovery.4-6,9 


3.  THEORY  AND  METHOD 

One  of  the  difficulties  inherent  in  data  visualization  is  the  problem  of  visualizing  multi-dimensionality.4-6-9  When 
there  are  more  than  three  variables,  it  stretches  the  imagination  to  visualize  their  relationships.  Fortunately  in  data 
-  ^  set  with  many  variables,  groups  of  variables  often  form  clusters.13-15-16  Thus,  our  approach  includes  two  major 
complementary  components:  (1)  dimensionality  reduction  by  probabilistic  principal  component  projection  and  (2) 
cluster  decomposition  by  adaptive  soft  data  clustering. 

Assume  the  data  points  {t*}  in  the  data  space  come  from  K0  clusters  {0tl, ...,  0tk, ...,  0tiro}>  where  6tk  is  the 
Laussian  kernel  parameter  vector  of  cluster  k  in  the  model.  Recently  there  has  been  considerable  success  in  using 
e  SFNM  to  model  the  distribution  of  a  multimodal  data  set,4-7-10-19-26  such  that  the  data  distribution  takes  a  sum 
°f  the  following  general  form: 


K0 


p(*)  = 


fc=l 


(i) 
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where  itk  is  the  corresponding  mixing  proportion,  with  0  <  <  1  and  =  1,  and  g  is  the  Gaussian  kernel. 

The  problem  of  SFNM  modeling  addresses  the  combined  estimation  of  regional  parameters  (ir^,  0tk)  and  detection 
of  structural  parameter  K0  in  Eq.  (1)  based  on  the  observations  t.  One  natural  criterion  used  for  estimating 
the  parameter  values  is  to  minimize  the  distance  between  the  SFNM  distribution  /(t)  and  the  data  histogram  /t. 
Suggested  by  information  theory,19’20  relative  entropy  (Kullback-Leibler  distance)  is  a  suitable  measure,  given  by 

£(/tll/)  =£/*(*)  i°g^.  (2) 

We  have  previously  shown  that  distance  minimization  based  on  (2)  is  equivalent  to  the  maximum  likelihood  (ML) 
estimation  under  a  data  independency  approximation,7  and  when  Ko  is  given,  the  ML  estimate  of  the  regional 
parameters  can  be  obtained  using  the  EM  algorithm.15,19’26 

There  are  three  major  problems  associated  with  the  current  approach.  First,  when  the  dimension  of  the  data 
space  is  high,  the  computational  complexity  of  implementing  the  EM  algorithm  in  t-space  is  very  high.  Second,  the 
initialization  of  the  EM  algorithm  is  often  heuristically  chosen,  which  may  lead  to  both  local  optima  and  compu¬ 
tational  complexity.  Finally,  since  the  number  of  the  local  clusters  in  a  particular  data  set  is  generally  unknown, 
model  selection  is  a  prerequisite.  A  natural  way,  with  greater  practical  applicability,  to  tackle  these  problems  is  to 
introduce  user  interaction  with  the  system.4,9  Data  mining  and  knowledge  discovery  axe  not  processes  that  can  be 
orchestrated  a  priori.  Training  algorithms  and  expected  behavior  can  be  specified,  but  the  actual  learning  must  follow 
for  insight  and  spontaneous  inspiration.9  For  example,  by  examining  plots  of  principal  component  space,  researchers 
often  develop  a  deeper  understanding  of  the  driving  forces  that  generated  the  original  data,  and  effortlessly  grasp 
the  general  characteristics  of  the  data  and  propose  an  initial  solution.4’6’9 

Principal  component  analysis  (PCA)  is  an  effective  method  for  achieving  dimensionality  reduction.11,12  For 
a  set  of  observed  d-dimensional  data  vectors  {t^},  i  €  the  q  principal  axes  wm,  m  €  are 

those  orthogonal  axes  onto  which  the  retained  variance  under  projection  is  maximal.  It  can  be  shown  that  the 
principal  axes  wm  are  given  by  the  q  dominant  eigenvectors  (i.e.,  maximal  eigenvalues)  of  the  sample  covariance 
matrix  Ct=  ~  Mt)(^  ~  £it)T/N  such  that  Ctwm  —  Amwm  and  where  fxt  is  the  sample  mean.  The  vector 
Xi  =  WT(t i  —  /**),  where  W  =  (w1,  w2, ...,  wg),  is  thus  a  q  dimensional  reduced  representation  of  the  observed 
vector  t*.  The  advantage  of  PCA  is  twofold:  the  projection  onto  the  principal  subspace  (1)  minimizes  the  squared 
reconstruction  error12,15  and  (2)  maximizes  the  separation  of  data  clusters.16  Although  the  effectiveness  of  applying 
PCA  in  an  unsupervised  manner  is  highly  data-dependent,  our  approach  has  a  simple  optimal  appeal  in  that  if  the 
local  clusters  are  linearly  separable  in  a  two-  or  three-dimensional  space,  the  principal  component  projections  allow 
best  separation  of  the  clusters.16 

Suppose  the  data  space  is  d-dimensional.  Now  consider  a  two-dimensional  projection  space  x  =  (xj,  2:2  )T  together 
with  a  linear  transformation,  that  maps  the  data  space  to  the  projection  space  by  x  —  WT(t  —  pit)  where  W  is  a 
d  x  2  matrix.  For  a  normal  distribution  p( t)  over  the  data  space,  using  the  rules  of  probability,  a  similar  reduced 
dimension  probability  distribution  of  the  new  variables  {x*}  in  the  projection  space  is  obtained  from  the  convolution 
of  the  projection  model  with  the  true  distribution  over  data  space  in  the  form  of  /(x)  =  /p(x|t)p(t)dt. 4,12,17  Since 
the  conditional  distribution  p(x]t)  =  6(x  —  WTt  -j-  WT/it),  where  S(.)  is  the  delta  function  that  6(0)  =  1  and 
6(^  0)  =  0,  it  can  be  shown  that  /(x)  is  simply  defined  by  the  Radon  transform  of  p(t),  i.e.,  /(x)  ~  f  p(t)S(x  — 
WT t  +  WT/xt  )dt . 18  According  to  the  linear  superposition  property  of  Radon  transform  and  the  projection  invariant 
property  of  normal  distribution,  if  p(t)  is  a  SFNM  distribution,  the  data  distribution  in  the  projection  space  has  a 
similar  reduced  dimension  form  as  Eq.  (1) 


However,  because  of  its  global  linearity,  the  application  of  PCA  is  necessarily  somewhat  limited.12,13  For 
example,  the  inherent  multimodal  nature  of  the  data  set  may  be  completely  obscured  when  it  is  projected  onto  the 
lower  dimensional  principal  subspace.  Thus,  it  is  important  to  note  that  although  the  cluster  structure  of  the  data  set 
may  be  evident  from  the  higher  dimensional  plot  of  the  raw  data,  it  is  quite  conceivable  to  have  the  intrinsic  cluster 
structure  of  the  data  concealed  after  a  projection  in  the  more  general  case  of  high-dimensional  data  sets.15  An 


138 


alternative  paradigm  is  to  model  multimodal  data  set  with  a  collection  of  local  linear  subspaces  through  probabilistic 
Pincipal  component  analysis  as  shown  in  Fig.  I.12-14  The  method  is  a  two-stage  procedure:  a  soft  partitioning  of 
the  data  space  followed  by  estimation  of  the  principal  subspace  within  each  partition.  For  the  sake  of  computational 
simplicity,  it  is  reasonable  to  consider  the  model  parameter  values  being  estimated  firstly  in  the  projection  space  and 
then  further  fine  tuned  in  the  data  space.14 

The  association  of  a  SFNM  distribution  with  PCA  offers  the  possibility  of  being  able  to  visualize  complex  data 
structures  through  a  mixture  of  probabilistic  principal  component  subspaces.  By  a  simple  extension  of  the  maximum 
a  posterior  for  data  classification  in  the  standard  ATo-axy  Bayes  hypothesis  testing,15,20  we  can  obtain  a  principal 
component  projection  along  the  desired  axes  onto  which  a  particular  portion  of  the  data  set  is  highlighted,  by 
weighting  all  of  the  data  points  in  the  whole  data  set  with  their  posterior  probabilities  belonging  to  that  portion. 
This  involves  a  soft  clustering  of  the  data  points  in  which  instead  of  any  given  data  point  being  assigned  exclusively 
to  one  principal  component  subspace,  the  responsibility  for  its  generation  is  shared  among  all  of  the  subspaces. 

Under  the  SFNM  model  defined  by  Eq.  (1),  the  posterior  Bayesian  probability  Zik  of  a  given  data  point  t* 
belonging  to  cluster  k  is 

_  7Tfc$(ti|0tfc) 

**"*  P(ti) 

where  k  =  1, 2, Kq  and  =  1*  These  posterior  probabilities,  together  with  the  computational  simplicity 

of  performing  PCA  (involving  no  more  than  finding  the  top  q  eigenvectors  of  the  covariance  matrix  of  the  data 
points)  make  it  a  good  candidate  for  the  linear  subspace  in  the  mixture.  The  q  principal  components  define  the 
local  subspace  assumed  for  the  multimodal.  The  contributions  of  the  input  to  the  k  subspace  are  the  activities  of 
the  weighted  data  points  {t**}  for  input  cluster  k .  This  can  be  obtained  by  =  Zik( t*  —  AHfc),  where  /Li**.  is  the 
weighted  sample  mean  of  cluster  k: 


(4) 


_  n  Ei  gfe(ti  ~  gtfcKtt  -  th fc)T 

zik 


(5) 


The  subspaces  for  the  focused  clusters  are  generated  by  a  localized  linear  PCA  such  that  Ctk^mk  —  ^mk^mk-  It 
is  important  to  understand  that  each  component  in  Eq.  (1)  now  corresponds  to  an  independent  subspace  model  with 
parameters  0X*  and  Wfc,  where  Wfc=  (wljb,  w2fc, w?fc)-  More  precisely,  consider  the  vector  x*jb  — 
to  be  a  q  dimensional  reduced  representation  of  A;- cluster  focused  vector  the  corresponding  probability  distribution 

is  defined  by 

s(x|Wfc)  8xk)  =  J g(t\8tk)S(x  -  Wjt  +  w£Mtfc)dt  (6) 

where  the  data  mapping  by  W*  leads  to  an  independent  Radon  transform.  To  interpret  the  corresponding  set 
of  visualization  subspaces,  it  may  be  useful  to  plot  all  of  the  data  points  on  every  plot.  For  this,  we  may  create 
a  ^-cluster  focused  projection  in  k- subspace  by  plotting  the  vector  x^.,  or  display  the  density  of  “gray-level”  in 
proportion  to  the  contribution  which  each  point  has  for  &-subspace  with  h[Wj(t^  — 

An  important  issue  concerning  unsupervised  cluster  decomposition  is  the  detection  of  the  structural  parameter  Kq, 
called  model  selection.7,14,15,19,25  This  is  indeed  particularly  critical  in  real-world  applications  where  the  structure  of 
the  data  patterns  may  be  arbitrarily  complex.5  We  propose  to  use  two  information  theoretic  criteria,  i.e.,  the  Akaike 
information  criterion  (AIC)21  and  minimum  description  length  (MDL),22  to  guide  model  selection.  The  major  thrust 
of  this  approach  has  been  the  formulation  of  a  model  fitting  procedure  in  which  an  optimal  model  is  selected  from 
the  several  competing  candidates  such  that  the  selected  model  best  fits  the  observed  data,  under  Jaynes5  minimax 
entropy  principle  stated  as  “the  parameters  in  a  model  which  determine  the  value  of  the  maximum  entropy  should 
be  assigned  values  which  minimize  the  maximum  entrap?/55.23,24  For  example,  AIC  tries  to  reformulate  the  problem 
explicitly  as  an  approximation  of  the  true  structure  by  the  model,  implying  that  AIC  will  select  the  model  that  gives 
the  minimum  value  defined  by 

AIC(Aro)  =  —2\o&{Cml)  +2Ka  (7) 

where  Lml  is  the  maximum  likelihood  of  the  model  and  Ka  is  the  number  of  free  adjustable  parameters  in  the  model. 
Prom  a  quite  different  point  of  view,  MDL  reformulates  the  problem  explicitly  as  an  information  coding  problem  in 
which  the  best  model  fit  is  measured  such  that  it  assigns  high  probabilities  to  the  observed  data  while  at  the  same 
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time  the  model  itself  is  not  too  complex  to  describe.22  A  model  is  selected  by  minimizing  the  total  description 
length  defined  by 

MDL(Ka)  =  -  log(£ML)  +  0.5  Ka  log  N. 

where  the  penalty  term  in  MDL  takes  into  account  the  number  of  observations.  It  should  be  pointed  out  that  when 
the  cluster  separability  is  poor,  the  performance  of  these  two  information  theoretic  criteria  may  not  be  reliable.21’25 

As  discussed  above,  the  SFNM  model  identification  is  first  performed  over  x-space.  However,  a  mapping  from 
t-space  to  x-space  may  have  the  intrinsic  cluster  structure  concealed,  leading  to  an  incorrect  correspondence  between 
Eq.  (1)  and  Eq.  (3).  We  now  extend  the  mixture  representation  of  Eq.  (1)  to  form  a  hierarchical  mixture  model 
generally  enough  to  be  applicable  to  mixtures  of  any  parametric  density  model.  Based  on  the  discussion  of  a  two-level 
system  consisting  of  a  single  Radon  transform  at  the  top  level  and  a  mixture  of  K0  normal  distributions  at  the  second 
level,  we  can  reformulate  the  hierarchy  to  a  third  level  by  associating  a  group  Qk  of  SFNM  models  with  each  model 
k  in  the  second  level,  given  by 

Kq  £fc,0 

p(t)  =  S71-*  X!  7ri|fc5(tl^t(fcj))  (9) 

fc=l  Jsst 1 

f 

where  7r^k  again  correspond  to  a  set  of  mixing  proportions,  one  for  each  ky  with  £\  Kj\k  —  1.  The  formation  of  the 
hierarchy  is  guided  by  the  model  selection  over  x-subspaces,  where  each  level  of  the  hierarchy  corresponds  to  a  generic 
model,  with  lower  levels  giving  more  focused  and  interpretable  representations.  Once  again  each  component  in  Eq. 
(9)  now  corresponds  to  an  independent  subspace  model  with  Radon  transform  #(x| 0x(ktj))  =  f9(t\et(kj))s(x- 

4.  ALGORITHMS 

Based  on  the  theory  behind  hierarchical  mixtures  of  probabilistic  principal  component  subspaces  we  have  discussed 
above,  we  now  present  the  description  of  our  algorithm  involving  major  steps  of  the  visual  hierarchy  construction. 
Although  the  tree  structure  of  the  hierarchy  may  be  empirically  defined,4,12  a  more  interesting  effort,  is  to  build  the 
tree  automatically  and  interactively.  Guided  by  the  two  information  theoretic  criteria,  our  algorithm  progressively 
proceeds  by  fitting  a  series  of  submodels  to  the  clusters  of  the  data  set,  in  which  model  order  is  selected  automatically 
and  algorithm  initialization  is  driven  interactively.  A  schematic  summary  of  the  algorithm  is  as  follows: 

1.  Project  the  data  set  onto  a  single  x-space,  in  which  W  is  determined  from  the  sample  covariance  matrix  Ct 
by  fitting  a  single  Gaussian  model  to  the  data  set  over  t-space. 

2.  Learn  /(x)  for  K  =  Kmin,  max,  in  which  the  values  of  7 rk  and  9^  are  initialized  by  the  user  and 
estimated  by  the  EM  algorithm  over  x-space. 

3.  Calculate  the  values  of  AIC  and  MDL  for  K  =  Km  in,  Kmax>  and  select  a  model  with  Kq  which  corresponds 

to  the  minimum  of  AIC  and  MDL.  The  model  parameters  obtained  in  x-space  will  be  used  to  initialize  the 
model  parameters  in  t-space  for  the  learning  in  step  4. 

4.  Learn  /( t)  with  Ko>  in  which  the  values  of  71-*.,  2^,  fj.tk,  and  Ctjfc,  are  fine  timed  by  the  EM  algorithm  over 
t-space.. 

5.  Determine  from  tifc  or  Ctk,  and  plot  or  h[Wj(t*  — onto  x-subspaces  at  the  second  level  for  visual 

evaluation,  for  k  =  1, 2, Kq.  , 

6.  Learn  Qk(t)  by  repeating  steps  2  —  4  and  construct  x-subspaces  at  the  third  level  by  repeating  step  5,  for 

*  =  1,2,...,  Kb- 

7.  Complete  the  whole  hierarchy  under  the  information  theoretic  criteria,  and  plot  all  x-subspaces  for  visual 
exploration  and  explanation. 
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Our  algorithm  begins  by  determining  W  for  the  top  level  projection.  For  low  rirmenRinnal  data  sets,  we  directly 
evaluate  the  covariance  matrix  Ct  to  find  W. 13,15  For  high  dimensional  cases,  since  only  the  top  two  eigenvectors 
of  the  covariance  matrix  of  the  data  points  are  of  the  interest,  it  may  be  computationally  more  efficient  to  apply  our 
previously  developed  APEX  neural  networks8  to  find  W  directly  from  the  data  points  t»  (Step  1).  On  the  h«sis  of 
this  single  x-space,  given  a  fixed  K,  the  user  then  selects  (Kmin,Kmax)  and  points  /j.xk  on  the  plot  corresponding 
to  the  centers  of  apparent  clusters.  The  EM  algorithm  can  be  applied  to  allow  a  SFNM  (Eq.  (3))  to  be  fitted  to  the 
projected  data  through  the  following  two-stage19,26  form: 


Bi-Step 


/(xrf,^) 


(10) 


M-Step 


i  N 

=  Ir>) 

AT  2. -J  Z*k 


Ni^ 


EN  { n ) 

—  fjfc  Xi 


where  at  each  complete  cycle  of  the  algorithm,  we  first  use  “old”  set  of  parameter  values  to  determine  the  posterior 
probabilities  using  Eq.  (10).  These  posterior  probabilities  are  then  used  to  obtain  “new”  values 
and  C&+1)  using  Eqs.(ll).  The  algorithm  cycles  back  and  forth  until  the  value  of  relative  entropy  (Eq.  (2))  reaches 
its  minimum  (Step-2).  It  can  be  shown  that,  at  each  stage  of  the  EM  algorithm,  the  relative  entropy  decreases  lmlpgfi 
it  is  already  at  a  local  minimum.19  The  model  selection  procedure  will  then  determine  the  optimal  number  Kq  of 
models  to  fit  at  the  next  level  down  using  the  two  information  theoretic  criteria,  where  Ka  —  6Ao  -  1  including  2Kq 
means,  2 Kq  variances,  Kq  correlation  coefficients,  and  Kq-1  mixing  factors  (Step  3).  The  resulting  points  in 
data  space,  obtained  by  —  W -f  are  then  used  as  the  initial  means  of  the  respective  submodels.  Since 
the  mixing  proportions  7Tk  are  projection-invariant,  we  simply  assign  a  2  x  2  unit  matrix  to  the  remaining  parameters 
of  the  covariance  matrix  Ct*.  Once  again  the  EM  algorithm  can  be  applied  to  allow  a  SFNM  (Eq.  (1))  with  Kq 
submodels  to  be  fitted  to  the  data  over  t-space.  In  order  to  obviate  the  need  to  store  all  the  incoming  observations, 
and  change  the  parameters  immediately  after  each  data  point,  it  may  be  computationally  more  efficient  to  apply  our 
previously,  developed  probabilistic  self-organizing  map  (PSOM),  an  incremental  EM  algorithm,7  to  estimate  p(t). 

With  a  soft  partitioning  of  the  data  set  using  the  PSOM,  data  points  will  now  effectively  belong  to  more  than  one 
cluster  at  any  given  level.  Thus,  the  effective  input  values  are  t**  —  ztk(ti  —  pLtk)  for  an  independent  visualization 
subspace  k  in  the  hierarchy.  We  then  extend  our  APEX  algorithm  to  a  probabilistic  version,  i.e.,  PAPEX,8’27  to 
determine  W&,  summarized  as  follows  (Step  4). 


1.  Initialize  the  feedforward  weight  vector  w mk  for  m  —  1,2,  and  the  feedback  weight  vector  a*,  to  small  random 
values  at  time  i  —  1.  Assign  a  small  positive  value  to  the  learning  rate  parameter  77. 

2.  Set  m  —  1,  and  for  i  —  1, 2, compute 

ylk(i)  =  vr?k(i)zik(ti  -  Mtfc),  wife(i  +  1)  =  w ik(i)  +  y[yik(i)^k(U  -  fi tk)  -  2/x&(i)wu(i)]  (12) 

For  large  i  we  have  Wifc(z)  — ►  w^,  where  wxk  is  the  eigenvector  associated  with  the  largest  eigenvalue  of  the 
covariance  matrix  Cfc. 

* 

3.  Set  m  =  2,  and  for  i  —  1, 2, ...,  compute  « 

V2k(i)  =  wjfc(i)^fc(ti  -  f*tk)  +  ak(i)ylk(i),  w2fc(i  +  1)  =  w2fc(i)  +  r][y2k(i)zik(ti  -  fitk)  -  yffc(i)w2i(i)]  (13) 

ak(i  +  1)  =  ak(i)  -  T][y2k(i)ylk{i)  +  y2fc(i)afc(z)]  (14) 

For  large  i  we  have  W2jt(z)  — >  W2 k,  where  w2jt  is  the  eigenvector  associated  with  the  second  largest  eigenvalue 
of  the  covariance  matrix  C*. 


Having  determined  principal  axes  Wfc  of  the  mixture  model  at  the  second  level,  we  will  construct  the  visualizatio 
subspaces  by  plotting  each  data  point  t*  at  the  corresponding  x^.  Thus  if  one  particular  point  takes  most  of  th* 
contribution  for  a  particular  component,  then  that  point  will  effectively  be  visible  only  on  the  corresponding  subspac! 


Determination  of  the  parameters  of  the  models  at  the  third  level  can  again  be  viewed  as  a  two-step  estimation 
problem,  in  which  further  split  of  the  models  at  the  second  level  is  determined  within  padi  of  the  subspaces  over 
x-space,  and  then  the  parameters  of  the  selected  models  are  fine  tuned  over  t-space.  Similarly,  the  resulting  model 
estimated  over  x-space  are  then  used  to  initialize  the  means  of  the  respective  submodels  over  t-space.  The  cor 
responding  Qk(t)  can  again  be  estimated  using  the  EM  or  PSOM  algorithm7’19-26  to  allow  a  SFNM  distribution 
with  Lkfi  submodels  to  be  fitted  to  the  data.  In  the  E-step,  the  posterior  probability  that  data  point  t*  belongs  to 
submodel  j  is  given  by  ° 


z i(ktj )  —  —  Zik 


(15) 


where  zik  are  constants  estimated  from  the  second  level  of  the  hierarchy.  The  corresponding  M-step  includes 


En 
__  i-1 


*j\k  : 


..  _  Yli= 1  Zi(ktj)*>i 

—  ^iV  > 

2wi=l  ^(kj) 


jt(kj)  - 


EN 

i=l  zi(k,j) 


'  ^t( kj))1 


(16) 


With  the  resulting  zi(k>j)  in  t-space,  we  can  apply  the  PAPEX  algorithm  to  estimate  W(M),  in  which  the  effective 
input  values  are  expressed  by  The  next  level  visualization  subspace  is  generated  by 

plotting  each  data  point  t;  at  the  corresponding  -  ^t(kJ))  in  (Jc,  j)-subspace  (Step  6). 

The  construction  of  the  entire  tree  structure  hierarchy  is  automatically  completed  when  no  further  data  split  is 
recommended  by  the  information  theoretic  criteria  in  all  of  the  parent  subspaces  (Step  7). 


5.  ILLUSTRATION  AND  APPLICATION 

We  first  illustrate  the  application  of  our  algorithm  to  a  simple  synthetic  data  set.  Fig.  1  (a)  shows  a  data  set 
consisting  of  450  data  points  generated  from  a  mixture  of  three  Gaussians  in  three-dimensional  space.  Each  Gaussian 
is  relatively  flat  (has  small  variance)  in  one  dimension.  Two  of  these  pancake-like  dusters  are  closely  spaced,  while 
the  third  is  well  separated  from  the  first  two.  The  dimensionality  of  this  data  set  has  been  chosen  to  illustrate  the 
basic  principle  of  the  approach.  The  global  view  of  the  raw  data  over  t-space  clearly  suggests  the  presence  of  three 
distinct  clusters  within  the  data. 

To  explore  the  data  characteristics,  we  first  perform  a  single  global  PCA  to  project  each  data  point  onto  a  single 
x-space  (top  level),  shown  in  Fig.  1  (b).  Both  the  user  inspection  and  the  two  information  theoretic  criteria  have 
clearly  suggested  the  presence  of  two  distinct  clusters  within  the  projected  data  set.  Based  on  a  soft  clustering  of  the 
data  points,  we  then  apply  PAPEX  to  both  clusters  and  generate  the  two  corresponding  independent  cluster-focused 
subspaces  (second  level),  as  shown  in  Fig.  1  (e).  Not  to  our  surprise,  the  two  information  theoretic  criteria  have 
suggested  a  further  split  of  cluster  2  but  not  of  duster  1.  Once  again  by  performing  three  independent  PAPEX,  the 
final  cluster  decomposition  through  the  duster-focused  subspaces  (third  level)  is  completed  shown  in  Fig.  1  (d). 

With  this  three-level  hierarchical  data  exploration,  the  capable  nature  of  the  approach  is  evident  as  the  interim 
two  subspaces  (second  level)  only  attempt  to  highlight  the  data  points  which  have  already  been  modeled  by  their 
immediate  ancestor  (top  level).  Indeed,  the  model  fitting  procedure  has  successfully  discovered  all  three  data  dusters. 
The  original  data  dusters  have  been  individually  colored,  and  it  can  be  seen  that  the  red,  yellow,  and  blue  data 
points  have  been  well  separated  and  highlighted  in  the  third  level  subspaces.  * 

As  an  example  of  a  more  complex  problem,  we  consider  a  data  set  arising  from  a  mixture  of  three  closely  spaced 
Gaussians  consisting  of  300  data  points,  shown  in  Fig.  2  (a).  Once  again  the  original  data  clusters  have  been 
individually  colored.  We  first  apply  APEX  to  extract  the  global  principal  axis,  indicated  by  the  black  line  in  Fig.  2 
(a).  The  two  information  theoretic  criteria  have  suggested  the  presence  of  three  distinct  clusters,  where  the  user  then 
selects  three  initial  duster  centers  and  the  EM/PSOM  algorithm  is  applied  to  perform  a  soft  clustering  of  the  data 
points.  This  leads  to  a  mixture  of  three  independent  probabilistic  prindpal  component  subspaces  whose  prindpal 
axes  are  separately  extracted,  indicated  by  the  yellow  lines  in  Fig.  2  (a).  The  contributions  of  each  data  point  to 
these  subspaces,  in  terms  of  its  “gray-level”  A[t»]  =  Ziki  are  displayed  over  t-space  in  Fig.  2  (b). 
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Since  the  model  selection  and  algorithm  initialization  are  performed  over  x-space  with  user’s  interaction,  it  may  be 
helpful  to  investigate  the  visual  effectiveness  of  dimensionality  reduction  using  the  probabilistic  principal  component 
projections.4’9  Based  on  the  estimated  Wfc,  we  have  constructed  each  of  the  cluster-focused  subspaces  using  both 
“data  graphics”  (e.g.,  in  terms  of  x,*  =  -/***))  and  “data  image”  (e.g.,  in  terms  of  A[Wj(t*  -  =  Zik) 

techniques.  As  a  more  overlapped  case,  Fig.  2  (c-d)  present  the  plots  of  “data  graphics”  and  “data  image”  from  the 
data  set,  where  “data  graphics”  emphasizes  the  contribution  of  a  particular  data  point  to  that  particular  subspace 
concerning  its  geometric  distance  to  the  center  of  the  cluster,  while  “data  image”  emphasizes  the  effectiveness  of  a 
data  point  reflecting  its  global  appearance.  It  can  be  seen  that  the  plot  of  each  cluster  is  clean  and  well-shaped. 

In  order  to  quantitatively  evaluate  the  effectiveness  of  our  approach  with  user  interactions,9  we  apply  our 
algorithm  to  a  synthesized  testing  data  set  given  in  Fig.  3  (up-left).  Using  the  APEX  algorithm  we  accurately 
estimate  the  top  global  principal  axis,  indicated  by  the  back  line.  By  projecting  the  data  points  onto  a  two- 
dimensional  x-space,  all  three  data  clusters  are  visible.  This  plot  indicates  that  although  the  second  advantage  of 
PCA  forementioned  is  highly  data-dependent,  when  the  data  clusters  are  linearly  separable  in  a  projection  space, 
the  principal  component  projections  allow  effective  separation  of  the  clusters.16  We  then  apply  the  two  information 
theoretic  criteria  to  examine  this  plots.  In  this  case,  we  set  K m  i  n  —  1  and  Km  ax  “  5.  The  minima  of  both  AIC 
and  MDL  have  clearly  suggested  a  three-cluster  data  structure,  as  given  by  the  curve  in  Fig.  3  (third  block  in  the 
second  row).  Thus  a  two-level  SFNM  model  may  be  sufficient.  We  then  conduct  two  experiments  to  assess  the 
performance  of  our  algorithm.  Since  all  the  model  parameters  are  known  in  this  case,  the  true  top  principal  axes  of 
the  data  clusters  have  been  individually  calculated.  First,  we  compare  the  estimated  top  principal  axes  of  the  data 
clusters  using  our  algorithm  with  the  corresponding  true  top  principal  axes.  From  the  down-right  block  in  Fig.  3,  it 
can  be  seen  that  the  two  sets  of  the  top  principal  axes  are  perfectly  matched  (blue  lines).  Second,  we  use  the  global 
relative  entropy  (GRE)  between  the  data  histogram  and  the  estimated  SFNM  model  to  measure  the  goodness  of 
model  fitting.  The  numerical  result  through  our  experiments  indicates  a  very  good  performance  with  a  GRE  value 
of  0.008  nats. 

User  interaction  with  the  algorithm  is  an  important  issue.  We  have  developed  a  user-friendly  graphical  interface 
to  facilitate  the  data  visualization  purpose,  as  shown  in  Fig.  3.  By  allowing  the  user  to  select  the  initial  centers 
of  the  data  clusters  demonstrated  in  Fig.  3,  our  experience  has  convincingly  indicated  a  great  reduction  of  both 
computational  complexity  and  local  optimum  likelihood.  For  example,  compared  to  the  results  of  model  selection 
reported  by  Akaike21  and  Wax,25  the  curves  of  the  AIC  and  MDL  generated  by  our  algorithm  are  much  more 
consistent  and  smooth,  and  user-initialized  computation  is  five  times  (in  average)  faster  than  the  random  trials.  It 
should  be  pointed  out  that  although  the  final  SFNM  model  can  be  estimated,  the  pathways  of  achieving  cluster 
decomposition  may  be  multiple.  For  example,  in  this  case  the  user  has  the  flexibility  to  select  only  two  clusters  in 
the  second  level  and  to  further  split  the  “right”  cluster,  thus  to  adopt  a  three-level  hierarchy.  We  believe  that  this 
user-driven  nature  of  the  current  algorithm  is  also  highly  appropriate  for  the  visualization  context.4*14 

Since  a  more  convincing  example  should  involve  more  clusters  with  multiple  levels,  we  have  also  applied  our 
algorithm  to  the  same  data  set  used  by  Bishop&Tipping,4  shown  in  Fig.  4  (a).  This  data  set  arises  from  a 
noninvasive  monitoring  system  used  to  determine  the  quantity  of  oil  in  a  multiphase  pipeline  containing  a  mixture 
of  oil,  water,  and  gas.4  The  experiment  gives  12  diagnostic  measurements  in  total.  Our  interim  goal  is  to  visualize 
the  structure  of  the  data  in  the  original  12-dimensional  space.  A  data  set  consisting  of  1,000  points  is  obtained 
synthetically  and  the  data  is  expected  to  have  an  intrinsic  dimensionality  of  two  corresponding  to  the  two  dominant 
components  (e.g.,  oil  and  water).  However,  the  presence  of  different  flow  configurations  leads  to  numerous  distinct 
clusters.  We  then  apply  our  algorithm  to  perform  a  cluster  discovery.  Results  from  partially  fitting  the  oil  flow  data 
using  a  three-level  hierarchical  model  are  given  in  Fig.  4.  It  should  be  pointed  out  that  since  the  “right”  answer  to 
this  real-world  data  set  is  not  available,  we  are  not  able  to  validate  this  new  result.  However,  we  believe  that  this 
example  has  clearly  been  highly  successful,  note  how  the  selected  single  cluster  (number  2)  in  the  top-level  plot,  is 
discovered  to  be  two  quite  separated  clusters  at  the  second  level. 

As  a  final  example,  we  consider  the  visual  explanation  in  computer-aided  diagnosis  (CAD)  for  breast  cancer 
detection.  As  a  step  toward  improving  the  performance  of  CAD  system,  we  have  put  considerable  efforts  to  conduct 
various  studies  and  develop  reliable  image  enhancement  and  lesion  segmentation  techniques.7  More  precisely,  we  try 
to  make  both  the  hidden  data  patterns  and  the  neural  network  “black  box”  to  be  as  transparent  as  possible  to  the 
user  (e.g.,  radiologists  and  patients)  through  interactive  visual  explanation.  The  clinical  goal  is  to  eliminate  the  false 
positive  sites  that  correspond  to  normal  dense  tissues  with  mass-like  appearances  through  featured  discrimination. 
We  adopt  a  mathematical  feature  extraction  procedure  to  construct  our  database  from  all  the  suspicious  mass  sites 
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localized  by  the  enhanced  segmentation.7  The  optimal  mapping  of  the  data  points  is  then  obtained  by  learning  the 
generalized  normal  mixtures  and  decision  boundaries,  where  a  probabilistic  modular  neural  network  is  developed  to 
carry  out  both  soft  and  hard  clustering.7  The  joint  histogram  of  the  featured  database  extracted  from  true  and  ialse 
mass  regions  are  investigated  and  the  features  that  can  better  separate  the  true  and  false  mass  sites  are  selected.7 
Our  experience  has  suggested  that  three  imagery  features,  i.e.,  site  area,  compactness,  and  difference  entropy,  were 
having  good  discrimination  and  reliability  properties. 

We  then  use  our  previously  developed  algorithm7  to  distinguish  the  true  masses  from  false  masses  based  on  the 
features  extracted  from  the  suspected  regions.  150  mammograms  were  selected  from  the  mammogram  database. 
Each  mammogram  contained  at  least  one  mass  case  of  varying  size  and  location.  The  areas  of  suspicious  masses  were 
identified  following  the  proposed  procedure  with  biopsy  proven  results.  In  a  typical  experiment,  we  have  selected 
a  three-dimensional  feature  space  consisting  of  compactness  I,  compactness  II,  and  difference  entropy.  It  should 
be  noticed  that  the  feature  vector  can  easily  extend  to  higher  dimensionality.  A  training  feature  vector  set  was 
constructed  from  50  true  mass  ROIs  and  50  false  mass  ROIs,  where  ROI  stands  for  region  of  the  interest  In  addition 
to  the  decision  boundaries  recommended  by  the  computer  algorithms,  a  visual  explanation  interface  has  also  been 
integrated  with  hierarchical  projections.  Fig.  5  (a)  shows  the  database  map  selection  with  compactness  definition  I 
and  difference  entropy.-  Fig.  5  (b)  shows  the  database  map  selection  with  compactness  definition  II  and  difference 
entropy.  Our  experience  has  suggested  that  the  recognition  rate  with  compactness  I  are  more  reliable  than  that  with 
compactness  II. 

We  have  conducted  a  preliminary  study  to  evaluate  the  performance  of  the  algorithms  in  real  case  detection,  in 
which  6  —  15  suspected  masses  per  mammogram  were  detected  and  required  further  clinical  decision  making.  We 
found  that  the  proposed  visual  explanation  approach,  together  with  CAD  system,  can  reduce  the  number  of  suspicious 
masses  with  a  sensitivity  of  84%  at  a  specificity  of  82%  (1.6  false  positive  findings  per  mammogram)  based  on  the 
database  containing  46  mammograms  (23  of  them  have  biopsy  proven  masses).  Fig.  6  shows  a  representative  mass 
detection  result  on  one  mammogram  with  a  stellate  mass,  indicated  by  the  arrow  in  Fig.  6  (a).  After  appropriate 
feature  extraction,  ten  sites  with  brightest  intensity  were  selected,  shown  in  Fig.  6  (b).  The  featured  vectors  of  these 
candidates  were  submitted  against  the  estimated  “probability  cloud”  for  visual  explanation  as  a  decision  support, 
together  with  the  opinion  recommended  by  our  CAD  system.  The  final  results  indicated  that  the  stellate  mass  lesion 
was  correctly  detected,  confirmed  by  our  experience  radiologists,  shown  in  Fig.  6  (c).  It  should  be  pointed  out  that 
in  this  real-world,  application,  a  higher  recognition  rate  may  be  controlled  by  the  domain  experts  in  balancing  the 
trade-off  between  the  false  positive  and  false  negative  rates.7 

6.  DISCUSSION 

We  have  presented  a  novel  approach  to  visual  explanation  for  data  mining  and  knowledge  discovery,  which  is  both 
statistically  principled  and  visually  effective.  This  method,  as  illustrated  by  the  well-planned  simulations  and  pilot 
applications  in  computer-aided  diagnosis,  can  be  very  capable  of  revealing  hidden  structure  within  data.  It  is 
important  to  emphasize  that  in  relation  to  previous  work,4,11"13  one  interesting  consideration  with  the  present 
algorithm  is  that  the  models  are  determined  by  the  information  theoretic  criteria,  and  this  criterion  can  not  only 
select  the  most  appropriate  model  structure  but  also  allow  an  user-driven  portfolio  as  a  double  check.  This  approach 
promotes  a  self-consistent  fitting  of  the  whole  tree,  so  that  an  automated  procedure  for  generating  the  hierarchy 
becomes  reality.4  In  addition,  since  we  perform  model  selection  and  parameter  initialization  firstly  over  the  projection 
space,  the  computational  complexity  is  greatly  reduced  in  compared  to  the  maximum  likelihood  estimation  in  full 
dimension.  Our  case  study  of  a  seven  dimensional  data  set  has  indicated  at  least  a  50%  reduction  of  the  computational 
time.  Other  possible  advantages  include  the  determination  of  data  projection  by  maximum  the  separation  of  clusters 
which  in  turn  optimizes  the  other  crucial  operations  such  as  model  selection  and  parameter  initialization,16  and 
data  rendering  algorithms  which  permit  user  or  hypothesis  driven  nature  of  the  data  projection.14 
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